Skip to content

Add working CI#82

Merged
rohita5l merged 11 commits into
mainfrom
wip2
May 22, 2026
Merged

Add working CI#82
rohita5l merged 11 commits into
mainfrom
wip2

Conversation

@rohita5l
Copy link
Copy Markdown
Collaborator

@rohita5l rohita5l commented May 22, 2026

Make CI work end-to-end

The CI workflow on main was never green — both the unit-test job and the e2e job failed for unrelated reasons. This PR fixes everything needed to get both to pass.

Summary of fixes

1. Switched runners (b7c3a4f)

databricks-protected-runner-group does TLS-level SNI filtering that blocks
egress to anything other than files.pythonhosted.org. uv can resolve from
the lockfile (direct file URLs) but anything that hits a /simple/ index —
including build-isolation deps for thrift's sdist — fails with TLS timeouts.
Both jobs now run on ubuntu-latest, matching what databricks/databricks-ai-bridge
uses for its uv run pytest CI.

2. DATABRICKS_BEARER as the CI auth path (0ae6d88)

databricks auth token only returns cached user-OAuth tokens — it doesn't
support M2M client_credentials. On hosted CI there's no ~/.databrickscfg
and no cached login, so the CLI always errored with "OAuth is not configured
for this host," and ucode's token helper hung on the auth login --no-browser
fallback waiting on stdin. Two changes:

  • has_valid_databricks_auth + get_databricks_token now short-circuit to
    a DATABRICKS_BEARER env var when set (the same env var
    build_auth_shell_command already honors for the agents' apiKeyHelper).
  • The e2e workflow pulls DATABRICKS_BEARER from a repo secret. The
    workspace admin generates a PAT with all-apis scope and pastes it in;
    CI uses the same bearer for the SDK helper, the agent runtime helpers,
    and all gateway API calls.

3. Install agent CLIs in CI (40c8c6f)

_require_binary("codex") etc. skipped every TestXxxLaunch test when the
agent wasn't on PATH, and _register_web_search_mcp raised FileNotFoundError
when it tried to claude mcp remove. Adding a single npm install -g step
for all six packages makes the launch tests actually run and lets MCP cleanup
shell out as designed.

4. Misc test bugs

  • tests/test_cli.py: --help output substring assertion ("--agents" in result.output) failed on CI because FORCE_COLOR=1 makes typer/rich split
    styled tokens across ANSI escapes (--\x1b[0m\x1b[1magents). Strip ANSI
    before checking (4065f14).
  • tests/test_e2e.py: TestConfigureSubset monkeypatched
    ucode.databricks.run_databricks_login, but cli.py does
    from ucode.databricks import run_databricks_login so the local name
    was never stubbed and the real function opened a browser. Patch
    ucode.cli.run_databricks_login instead (835adef).
  • mcp_web_search.py: Bumped urllib timeout 60s → 180s. Codex Responses API
    with native web_search does a real web fetch + completion, easily
    exceeds 60s under load (835adef).

5. Pre-existing lint failures (6cbc26c)

test_ruff_format and test_ty were already failing on main before this
branch — fixed both:

  • Ran ruff format src/.
  • Typed the run() wrapper with @overload so text=True narrows to
    CompletedProcess[str]. Guard _scrub_json's dict-key match against
    non-str keys.

Required repo configuration

After merging, the following need to be set in repo Settings:

Type Name Value
Variable E2E_ENABLED true (gates the e2e job)
Secret UCODE_TEST_WORKSPACE workspace URL
Secret DATABRICKS_BEARER PAT with all-apis scope

DATABRICKS_BEARER is short-lived; rotate when CI starts failing with 401s.
A future improvement could exchange DATABRICKS_CLIENT_ID/SECRET at job
start to keep the bearer always fresh.

Test plan

  • test job: 535 unit tests pass on ubuntu-latest
  • e2e job: all 25 e2e tests pass (including the six agent launch tests
    exercising real binaries against the workspace's AI Gateway endpoints)
  • No browser auth ever triggered from CI

The hostname is pypi-proxy.cloud.databricks.com (hyphen), not
pypi.proxy.cloud.databricks.com — the dotted form doesn't resolve.
@rohita5l rohita5l requested a review from AarushiShah-db as a code owner May 22, 2026 20:57
rohita5l added 10 commits May 22, 2026 17:21
Confirms whether the runner can reach files.pythonhosted.org,
pypi-proxy.cloud.databricks.com, and pypi.org. Earlier runs suggested
the runner can reach pythonhosted directly (wheel downloads succeed)
but can't reach the proxy or pypi.org for /simple/ lookups. Run this
to get hard evidence before talking to the runner-group owners.

Errors are tolerated so we see all three results before pytest runs.
The databricks-protected-runner-group's egress firewall blocks TLS to
pypi.org and pypi-proxy.cloud.databricks.com (only files.pythonhosted.org
is reachable for TLS handshakes), which prevents uv from resolving
build-isolation deps like setuptools to compile thrift's sdist.

Other Databricks Python repos (e.g. databricks-ai-bridge) run their
pytest CI on plain ubuntu-latest with no proxy and it works fine.
Switch both jobs to match. The test job has no need for protected
egress, and e2e only uses secrets passed via env, which work the
same on either runner.

Also removes the now-unused UV_INDEX_URL env var and the temporary
network diagnostic step.
CI runners (e.g. GitHub Actions ubuntu-latest) set FORCE_COLOR=1, which
makes rich/typer render --help with SGR escapes that split styled tokens
across ANSI codes. ``"--agents" in result.output`` then fails because the
rendered output is actually ``--\x1b[0m\x1b[1magents``. The test passed
locally because non-TTY runs don't get colored.

Strip ANSI before checking so the assertion holds either way.
- ruff format: auto-formatted src/ucode/databricks.py (collapsed lines
  ruff now considers fitting under the 100-char limit).
- ty: typed the run() wrapper with @overload so text=True narrows to
  CompletedProcess[str] (every caller that reads the return value uses
  text=True). Also guard _scrub_json's dict-key match against non-str
  keys so re.Pattern[str].search gets a str.

Test suite stays at 535 passing.
The prior commit ran `git add -A` and swept up local-only scratch files
(.antigravitycli/, .claude/, .vscode/, OPENCODE_PLAN.md, mlflow.db,
scripts/) that should never have been tracked. Removing them from the
index — they remain on disk locally.
The Databricks CLI's `auth login --no-browser` fallback path reads stdin
for the user to paste an OAuth code. In CI the runner's stdin sometimes
keeps the subprocess alive past Python's timeout, hanging the whole job
for the full ~6h workflow limit. Redirecting stdin to /dev/null makes
that fallback EOF immediately, so if M2M env vars don't yield a token
we fail fast with a clear error instead of hanging.
Last CI run failed with `Databricks CLI returned no access token` after
get_databricks_token's pytest fixture couldn't authenticate via M2M env
vars. The CLI's stderr is captured (and discarded) by ucode so we never
see *why* M2M failed. This diagnostic step prints the CLI version, which
auth env vars are set, whether ~/.databrickscfg exists, and the raw
output of `databricks auth token` + `current-user me` so the next run
will tell us exactly what's wrong.
`databricks auth token` only returns cached user-OAuth tokens (from
`databricks auth login`) — it has no M2M path. Hosted-runner CI has no
~/.databrickscfg and no cached login, so the CLI command always errors
with "OAuth is not configured for this host", and ucode's token helper
hangs on the `auth login --no-browser` fallback waiting for stdin.

Teach has_valid_databricks_auth and get_databricks_token to short-circuit
to a pre-fetched DATABRICKS_BEARER env var (the same env var
build_auth_shell_command already honors). The e2e workflow pulls it from
a repo secret; user fetches a fresh bearer (e.g. M2M client_credentials
against /oidc/v1/token) and pastes it into Settings → Secrets when the
token expires (~1h).

Removes DATABRICKS_CLIENT_ID/_SECRET from the workflow — they're no longer
load-bearing now that we don't call the CLI for auth.
- tests/test_e2e.py: TestConfigureSubset patches `run_databricks_login`
  at the wrong module path. cli.py does `from ucode.databricks import
  run_databricks_login`, so patching `ucode.databricks.run_databricks_login`
  doesn't affect the local name cli uses, and the real function runs
  (which opens a browser). Patch `ucode.cli.run_databricks_login` instead.
- mcp_web_search.py: bump urllib timeout from 60s to 180s. Codex
  Responses API with native web_search does a real web fetch + LLM
  completion, which legitimately can exceed 60s under load.
The six TestXxxLaunch tests all call `_require_binary("...")` and skip
when the agent isn't on PATH. The runner has only Node.js, not the
agents themselves, so every launch test skipped. Install all six
via `npm install -g` so they exercise real binaries against the e2e
workspace.
@rohita5l rohita5l changed the title Fix Databricks pypi proxy hostname Add working CI May 22, 2026
@rohita5l rohita5l merged commit ed316e0 into main May 22, 2026
2 checks passed
@rohita5l rohita5l deleted the wip2 branch May 22, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant